A Theory and Toolkit for the Mathematics of Privacy: Methods for Anonymizing Data while Minimizing Information Loss
نویسندگان
چکیده
Privacy laws are an important facet of our society. But they can also serve as formidable barriers to medical research. The same laws that prevent casual disclosure of medical data have also made it difficult for researchers to access the information they need to conduct research into the causes of disease. But it is possible to overcome some of these legal barriers through technology. The US law known as HIPAA, for example, allows medical records to be released to researchers without patient consent if the records are provably anonymized prior to their disclosure. It is not enough for records to be seemingly anonymous. For example, one researcher estimates that 87.1% of the US population can be uniquely identified by the combination of their zip, gender, and date of birth – fields that most people would consider anonymous. One promising technique for provably anonymizing records is called k-anonymity. It modifies each record so that it matches k other individuals in a population – where k is an arbitrary parameter. This is achieved by, for example, changing specific information such as a date of birth, to a less specific counterpart such as a year of birth. Previous studies have shown that achieving kanonymity while minimizing information loss is an NP-hard problem; thus a brute force search is out of the question for most real world data sets. In this thesis, we present an open source Java toolkit that seeks to anonymize data while minimizing information loss. It uses an optimization framework and methods typically used to attack NP-hard problems including greedy search and clustering strategies. To test the toolkit a number of previously unpublished algorithms and information loss metrics have been implemented. These algorithms and measures are then empirically evaluated using a data set consisting of 1000 real patient medical records taken from a local hospital.
منابع مشابه
M-Partition Privacy Scheme to Anonymizing Set-Valued Data
In distributed databases there is an increasing need for sharing data that contain personal information. The existing system presented collaborative data publishing problem for anonymizing horizontally partitioned data at multiple data providers. M-privacy guarantees that anonymized data satisfies a given privacy constraint against any group of up to m colluding data providers. The heuristic al...
متن کاملA Lightweight Privacy-preserving Authenticated Key Exchange Scheme for Smart Grid Communications
Smart grid concept is introduced to modify the power grid by utilizing new information and communication technology. Smart grid needs live power consumption monitoring to provide required services and for this issue, bi-directional communication is essential. Security and privacy are the most important requirements that should be provided in the communication. Because of the complex design of s...
متن کاملMinimizing Loss of Information at Competitive PLIP Algorithms for Image Segmentation with Noisy Back Ground
In this paper, two training systems for selecting PLIP parameters have been demonstrated. The first compares the MSE of a high precision result to that of a lower precision approximation in order to minimize loss of information. The second uses EMEE scores to maximize visual appeal and further reduce information loss. It was shown that, in the general case of basic addition, subtraction, or mul...
متن کاملA centralized privacy-preserving framework for online social networks
There are some critical privacy concerns in the current online social networks (OSNs). Users' information is disclosed to different entities that they were not supposed to access. Furthermore, the notion of friendship is inadequate in OSNs since the degree of social relationships between users dynamically changes over the time. Additionally, users may define similar privacy settings for their f...
متن کاملImproved Univariate Microaggregation for Integer Values
Privacy issues during data publishing is an increasing concern of involved entities. The problem is addressed in the field of statistical disclosure control with the aim of producing protected datasets that are also useful for interested end users such as government agencies and research communities. The problem of producing useful protected datasets is addressed in multiple computational priva...
متن کامل